home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 11
/
Cream of the Crop 11-1.iso
/
compress
/
act_27.zip
/
CALGARY.SET
< prev
next >
Wrap
Text File
|
1995-12-30
|
4KB
|
88 lines
Version Twenty Seven, Archive Comparison Table [30 December 1995]
[ACT\CALGARY.SET]
The following files were used in the Calgary/Canterbury text compression
corpus test. For more details see below.
Name Size Description
---------------------------------------------------------------------------
BIB 111,261 Bibliographic files (refer format)
BOOK1 768,771 Hardy: Far from the madding crowd
BOOK2 610,856 Witten: Principles of computer speech
GEO 102,400 Geophysical data
NEWS 377,109 News batch file
OBJ1 21,504 Compiled code for Vax: compilation of progp
OBJ2 246,814 Compiled code for Apple Macintosh: Knowledge support
system
PAPER1 53,161 Witten, Neal and Cleary: Arithmetic coding for data
compression
PAPER2 82,199 Witten: Computer (in)security
PAPER3 46,526 Witten: In search of "autonomy"
PAPER4 13,286 Cleary: Programming by example revisited
PAPER5 11,954 Cleary: A logical implementation of arithmetic
PAPER6 38,105 Cleary: Compact hash tables using bidirectional
linear probing
PIC 513,216 Picture number 5 from the CCITT Facsimile test files
(text + drawings)
PROGC 39,611 C source code: compress version 4.0
PROGL 71,646 Lisp source code: system software
PROGP 49,379 Pascal source code: prediction by partial matching
evaluation program
TRANS 93,695 Transcript of a session on a terminal
---------------------------------------------------------------------------
18 Files, 3,251,493 bytes in total size, but actually takes up 3,325,952
bytes, due to file slack (2%).
*** More Details ***
This corpus is used in the book
Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression.
Prentice Hall, Englewood Cliffs, NJ, 1990
and in the survey paper
Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text
compression," Computing Surveys 21(4): 557-591; December 1989,
to evaluate the practical performance of various text compression schemes.
Several other researchers are now using the corpus to evaluate text
compression schemes.
Nine different types of text are represented, and to confirm that the
performance of schemes is consistent for any given type, many of the types
have more than one representative. Normal English, both fiction and
non-fiction, is represented by two books and papers (labeled book1, book2,
paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of
English writing are found in a bibliography (bib) and a batch of unedited
news articles (news). Three computer programs represent artificial languages
(progc, progl, progp). A transcript of a terminal session (trans) is
included to indicate the increase in speed that could be achieved by
applying compression to a slow line to a terminal. All of the files
mentioned so far use ASCII encoding. Some non-ASCII files are also
included: two files of executable code (obj1, obj2), some geophysical data
(geo), and a bit-map black and white picture (pic). The file geo is
particularly difficult to compress because it contains a wide range of data
values, while the file pic is highly compressible because of large amounts
of white space in the picture, represented by long runs of zeros.
More details of the individual texts are given in the book mentioned above.
Both book and paper give the results of compression experiments on these
texts.
The corpus itself constitutes files bib, book1, book2, geo, news, obj1,
obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl,
progp and trans. (The book and paper above do not give results for files
paper3, paper4, paper5 or paper6.)
The directory "index" contains the sizes of the files and some information
about where they came from.
Ian H. Witten Timothy C. Bell
Computer Science Department Computer Science Department
University of Calgary University of Canterbury
Calgary T2N 1N4, Canada Christchurch 1, New Zealand
Phone (403) 220-6780 Phone (64-3) 642352
email: ian@cpsc.UCalgary.CA email: tim@cosc.canterbury.ac.nz